Importing required libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Dataset: https://www.kaggle.com/datasets/danbraswell/us-tornado-dataset-1950-2021
This data set gives information about american tornados from the year 1950 until the year 2021.
From Kaggle: Origin
This dataset was derived from a dataset produced by NOAA's Storm Prediction Center. The primary changes made to create this dataset were the deletion of some columns, change of some data types, and sorting by date. Column Definitions
yr - 4-digit year
mn - Month (1-12)
dy - Day of month
date - Datetime object (e.g. 1950-01-01)
st - State where tornado originated; 2-digit abbreviation
mag - F rating thru Jan 2007; EF rating after Jan 2007 (-9 if unknown rating)
inj - Number of injuries
fat - Number of fatalities
slat - Starting latitude in decimal degrees
slon - Starting longitude in decimal degrees
elat - Ending latitude in decimal degrees (value of 0 if missing)
elon - Ending longitude in decimal degrees (value of 0 if missing)
len - Length of track in miles
wid - Width in yards
My goal is to really identify which state and which timeframe is the most likely to have damaging tornados occur.
Importing and Previewing Data
df = pd.read_csv('US_Tornados_1950_2021.csv')
df.head()
| yr | mo | dy | date | st | mag | inj | fat | slat | slon | elat | elon | len | wid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1950 | 1 | 3 | 1950-01-03 | IL | 3 | 3 | 0 | 39.10 | -89.30 | 39.12 | -89.23 | 3.6 | 130 |
| 1 | 1950 | 1 | 3 | 1950-01-03 | MO | 3 | 3 | 0 | 38.77 | -90.22 | 38.83 | -90.03 | 9.5 | 150 |
| 2 | 1950 | 1 | 3 | 1950-01-03 | OH | 1 | 1 | 0 | 40.88 | -84.58 | 0.00 | 0.00 | 0.1 | 10 |
| 3 | 1950 | 1 | 13 | 1950-01-13 | AR | 3 | 1 | 1 | 34.40 | -94.37 | 0.00 | 0.00 | 0.6 | 17 |
| 4 | 1950 | 1 | 25 | 1950-01-25 | IL | 2 | 0 | 0 | 41.17 | -87.33 | 0.00 | 0.00 | 0.1 | 100 |
Creating a correlation matrix to evaluate any relationships
sns.heatmap(df.corr())
<AxesSubplot:>
df.shape
(67558, 14)
df.isnull().sum()
yr 0 mo 0 dy 0 date 0 st 0 mag 0 inj 0 fat 0 slat 0 slon 0 elat 0 elon 0 len 0 wid 0 dtype: int64
df.dtypes
yr int64 mo int64 dy int64 date object st object mag int64 inj int64 fat int64 slat float64 slon float64 elat float64 elon float64 len float64 wid int64 dtype: object
df.nunique()
yr 72 mo 12 dy 31 date 12300 st 53 mag 7 inj 209 fat 50 slat 14215 slon 16024 elat 15043 elon 16571 len 2429 wid 405 dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 67558 entries, 0 to 67557 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 yr 67558 non-null int64 1 mo 67558 non-null int64 2 dy 67558 non-null int64 3 date 67558 non-null object 4 st 67558 non-null object 5 mag 67558 non-null int64 6 inj 67558 non-null int64 7 fat 67558 non-null int64 8 slat 67558 non-null float64 9 slon 67558 non-null float64 10 elat 67558 non-null float64 11 elon 67558 non-null float64 12 len 67558 non-null float64 13 wid 67558 non-null int64 dtypes: float64(5), int64(7), object(2) memory usage: 7.2+ MB
Converting to String in preparation for Month names
df['mo'] = df['mo'].astype(str)
df.dtypes
yr int64 mo object dy int64 date object st object mag int64 inj int64 fat int64 slat float64 slon float64 elat float64 elon float64 len float64 wid int64 dtype: object
Converting to Month names for visuals
df['mo'] = df['mo'].replace(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'], ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])
df
| yr | mo | dy | date | st | mag | inj | fat | slat | slon | elat | elon | len | wid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1950 | January | 3 | 1950-01-03 | IL | 3 | 3 | 0 | 39.1000 | -89.3000 | 39.1200 | -89.2300 | 3.60 | 130 |
| 1 | 1950 | January | 3 | 1950-01-03 | MO | 3 | 3 | 0 | 38.7700 | -90.2200 | 38.8300 | -90.0300 | 9.50 | 150 |
| 2 | 1950 | January | 3 | 1950-01-03 | OH | 1 | 1 | 0 | 40.8800 | -84.5800 | 0.0000 | 0.0000 | 0.10 | 10 |
| 3 | 1950 | January | 13 | 1950-01-13 | AR | 3 | 1 | 1 | 34.4000 | -94.3700 | 0.0000 | 0.0000 | 0.60 | 17 |
| 4 | 1950 | January | 25 | 1950-01-25 | IL | 2 | 0 | 0 | 41.1700 | -87.3300 | 0.0000 | 0.0000 | 0.10 | 100 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 67553 | 2021 | December | 30 | 2021-12-30 | GA | 1 | 0 | 0 | 31.1703 | -83.3804 | 31.1805 | -83.3453 | 2.19 | 150 |
| 67554 | 2021 | December | 30 | 2021-12-30 | GA | 1 | 0 | 0 | 31.6900 | -82.7300 | 31.7439 | -82.5412 | 11.71 | 300 |
| 67555 | 2021 | December | 31 | 2021-12-31 | AL | 1 | 0 | 0 | 34.2875 | -85.7878 | 34.2998 | -85.7805 | 0.95 | 50 |
| 67556 | 2021 | December | 31 | 2021-12-31 | GA | 1 | 0 | 0 | 33.7372 | -84.9998 | 33.7625 | -84.9633 | 2.75 | 150 |
| 67557 | 2021 | December | 31 | 2021-12-31 | GA | 1 | 6 | 0 | 33.5676 | -83.9877 | 33.5842 | -83.9498 | 2.50 | 75 |
67558 rows × 14 columns
sns.set(rc={"figure.figsize":(15, 8)})
sns.barplot(data=df, x='mo', y='fat').set(title='Fatalities per month')
[Text(0.5, 1.0, 'Fatalities per month')]
sns.barplot(data=df, x='mo', y='inj').set(title='Injuries per month')
[Text(0.5, 1.0, 'Injuries per month')]
Similiar can be said for injuries
stData = df.groupby('st').sum()
stData.head()
| yr | dy | mag | inj | fat | slat | slon | elat | elon | len | wid | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| st | |||||||||||
| AK | 7972 | 61 | 0 | 0 | 0 | 236.9000 | -630.5000 | 236.9000 | -630.5000 | 0.20 | 10 |
| AL | 4706753 | 38632 | 2381 | 8672 | 665 | 77548.7157 | -204722.1012 | 59192.3210 | -155354.7776 | 13047.41 | 412780 |
| AR | 3808313 | 29740 | 2119 | 5408 | 400 | 66753.3221 | -176617.5860 | 48549.2136 | -128049.1970 | 11356.39 | 348494 |
| AZ | 537125 | 4341 | 104 | 152 | 3 | 9148.7256 | -30158.2005 | 4222.5598 | -13712.7914 | 527.35 | 19709 |
| CA | 921157 | 7223 | 159 | 90 | 0 | 16787.2859 | -55287.0377 | 9499.5146 | -31056.3938 | 520.80 | 19524 |
stData['fat'].value_counts()[:10]
0 11 1 4 4 4 2 2 194 2 438 1 28 1 60 1 31 1 226 1 Name: fat, dtype: int64
# Ten largest values in column fat
deadliestST = stData.nlargest(10, ['fat'])
deadliestST.head()
| yr | dy | mag | inj | fat | slat | slon | elat | elon | len | wid | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| st | |||||||||||
| AL | 4706753 | 38632 | 2381 | 8672 | 665 | 77548.7157 | -204722.1012 | 59192.3210 | -155354.7776 | 13047.41 | 412780 |
| TX | 18188289 | 149133 | 5096 | 9412 | 591 | 293386.8538 | -901295.0097 | 147586.7371 | -451737.2735 | 22182.26 | 846104 |
| MS | 4940920 | 40654 | 2587 | 6452 | 476 | 80361.5682 | -221997.9981 | 59312.3935 | -163382.8908 | 14794.95 | 451986 |
| OK | 8133230 | 65898 | 2641 | 5997 | 438 | 145490.8846 | -398575.9309 | 85184.3134 | -232743.2105 | 15508.93 | 593659 |
| TN | 2656272 | 20272 | 1412 | 4936 | 407 | 47652.9264 | -115638.7184 | 35960.8603 | -86890.5627 | 6690.46 | 212440 |
deadliestST.reset_index(inplace=True)
deadliestST.columns
Index(['st', 'yr', 'dy', 'mag', 'inj', 'fat', 'slat', 'slon', 'elat', 'elon',
'len', 'wid'],
dtype='object')
sns.barplot(data=deadliestST, x='st', y='fat').set(title='Fatalities in the top 10 deadliest states')
[Text(0.5, 1.0, 'Fatalities in the top 10 deadliest states')]
sns.barplot(data=deadliestST, x='st', y='inj').set(title='Injuries in the top 10 deadliest states')
[Text(0.5, 1.0, 'Injuries in the top 10 deadliest states')]
import plotly.express as px
import plotly.graph_objects as go
fig = px.choropleth(deadliestST,
locations='st',
locationmode="USA-states",
scope="usa",
color='fat',
color_continuous_scale="YlOrRd"
)
fig.update_layout(
title_text = 'Top Ten Fatality States in the last 70 or so years')
fig.show()
stMap = stData
stMap = stMap.reset_index()
stMap.columns
Index(['st', 'yr', 'dy', 'mag', 'inj', 'fat', 'slat', 'slon', 'elat', 'elon',
'len', 'wid'],
dtype='object')
fig = px.choropleth(stMap,
locations='st',
locationmode="USA-states",
scope="usa",
color='fat',
color_continuous_scale="YlOrRd"
)
fig.update_layout(
title_text = 'Fatalities per State in the last 70 or so years')
fig.show()
df.head()
| yr | mo | dy | date | st | mag | inj | fat | slat | slon | elat | elon | len | wid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1950 | January | 3 | 1950-01-03 | IL | 3 | 3 | 0 | 39.10 | -89.30 | 39.12 | -89.23 | 3.6 | 130 |
| 1 | 1950 | January | 3 | 1950-01-03 | MO | 3 | 3 | 0 | 38.77 | -90.22 | 38.83 | -90.03 | 9.5 | 150 |
| 2 | 1950 | January | 3 | 1950-01-03 | OH | 1 | 1 | 0 | 40.88 | -84.58 | 0.00 | 0.00 | 0.1 | 10 |
| 3 | 1950 | January | 13 | 1950-01-13 | AR | 3 | 1 | 1 | 34.40 | -94.37 | 0.00 | 0.00 | 0.6 | 17 |
| 4 | 1950 | January | 25 | 1950-01-25 | IL | 2 | 0 | 0 | 41.17 | -87.33 | 0.00 | 0.00 | 0.1 | 100 |
df.dtypes
yr int64 mo object dy int64 date object st object mag int64 inj int64 fat int64 slat float64 slon float64 elat float64 elon float64 len float64 wid int64 dtype: object
stCounts = df.value_counts('st').reset_index()
stCounts.head()
| st | 0 | |
|---|---|---|
| 0 | TX | 9149 |
| 1 | KS | 4375 |
| 2 | OK | 4092 |
| 3 | FL | 3497 |
| 4 | NE | 2967 |
stCounts.columns
Index(['st', 0], dtype='object')
stCounts = stCounts.rename(columns={0: "Count"})
stCounts.dtypes
st object Count int64 dtype: object
#sns.barplot(data=stCounts, x='Count', y='st').set(title='Tornado counts per state')
torCounts = stCounts['Count']
states = stCounts['st']
fig = plt.figure(figsize = (20,10))
plt.bar(states, torCounts)
plt.title('Tornado counts per state')
plt.ylabel('Count')
plt.xlabel('State')
plt.show()
moCounts = df.value_counts("mo")
moCounts.columns = ['mo', 'count']
moCounts.head(12)
mo May 14818 June 12492 April 9573 July 6971 August 4788 March 4514 September 3471 October 2802 November 2647 February 1945 December 1818 January 1719 dtype: int64
moCounts.columns
['mo', 'count']
sns.lineplot(data=moCounts).set(title='Tornado instances per month')
[Text(0.5, 1.0, 'Tornado instances per month')]
TX = df[df['st'].str.contains('TX')]
TX.head()
| yr | mo | dy | date | st | mag | inj | fat | slat | slon | elat | elon | len | wid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 1950 | January | 26 | 1950-01-26 | TX | 2 | 2 | 0 | 26.88 | -98.12 | 26.88 | -98.05 | 4.7 | 133 |
| 7 | 1950 | February | 11 | 1950-02-11 | TX | 2 | 0 | 0 | 29.42 | -95.25 | 29.52 | -95.13 | 9.9 | 400 |
| 8 | 1950 | February | 11 | 1950-02-11 | TX | 2 | 5 | 0 | 32.35 | -95.20 | 32.42 | -95.20 | 4.6 | 100 |
| 9 | 1950 | February | 11 | 1950-02-11 | TX | 2 | 6 | 0 | 32.98 | -94.63 | 33.00 | -94.70 | 4.5 | 67 |
| 10 | 1950 | February | 11 | 1950-02-11 | TX | 3 | 12 | 1 | 29.67 | -95.05 | 29.83 | -95.00 | 12.0 | 1000 |
txCounts = df.value_counts("mo")
txCounts.columns = ['mo', 'count']
txCounts.head(12)
mo May 14818 June 12492 April 9573 July 6971 August 4788 March 4514 September 3471 October 2802 November 2647 February 1945 December 1818 January 1719 dtype: int64
txCounts.columns
['mo', 'count']
txCounts.dtypes
dtype('int64')
txMoCounts = df.groupby('mo').sum()
txMoCounts.head(12)
| yr | dy | mag | inj | fat | slat | slon | elat | elon | len | wid | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| mo | |||||||||||
| April | 19070543 | 161950 | 8750 | 30103 | 1863 | 342892.5693 | -8.786775e+05 | 234770.5950 | -596096.1246 | 49238.74 | 1440833 |
| August | 9533227 | 75922 | 2284 | 2967 | 128 | 185718.8583 | -4.389518e+05 | 101695.6803 | -235473.7705 | 9870.59 | 348604 |
| December | 3628253 | 29335 | 1818 | 4470 | 280 | 62386.1304 | -1.656981e+05 | 46658.6122 | -122608.2432 | 9370.05 | 230605 |
| February | 3874228 | 31893 | 1989 | 6550 | 471 | 65147.9998 | -1.746559e+05 | 43713.6637 | -115702.1465 | 9678.83 | 260687 |
| January | 3431080 | 27620 | 1550 | 2864 | 171 | 57136.2030 | -1.548184e+05 | 42466.7059 | -114081.6532 | 7231.04 | 231177 |
| July | 13867895 | 103018 | 3547 | 2304 | 73 | 280468.3024 | -6.462901e+05 | 142898.2403 | -321788.8878 | 13303.95 | 502237 |
| June | 24847824 | 181156 | 7400 | 9636 | 569 | 492467.7621 | -1.184246e+06 | 265686.0307 | -626006.8624 | 31986.68 | 964279 |
| March | 8990220 | 77662 | 4217 | 10732 | 776 | 157129.9508 | -4.117682e+05 | 106167.0159 | -276270.5484 | 23736.07 | 674297 |
| May | 29507277 | 247811 | 8433 | 17892 | 1313 | 550775.5595 | -1.405238e+06 | 341478.0055 | -860179.7299 | 49667.61 | 1640330 |
| November | 5275431 | 42592 | 2702 | 5325 | 269 | 91334.2088 | -2.377734e+05 | 62655.3987 | -162300.4312 | 13168.46 | 331375 |
| October | 5592567 | 45240 | 1874 | 2468 | 102 | 98997.3655 | -2.563770e+05 | 70688.1423 | -181269.1689 | 8816.60 | 307695 |
| September | 6912512 | 51393 | 2137 | 1829 | 97 | 124812.1600 | -3.138490e+05 | 76762.2293 | -188062.0112 | 8921.07 | 268012 |
sns.lineplot(data=txMoCounts, x="mo", y="fat").set(title='Texas Tornado fatalities per month')
[Text(0.5, 1.0, 'Texas Tornado fatalities per month')]
tornadoMag = df[df.mag != -9]
tornadoMag.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 66953 entries, 0 to 67557 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 yr 66953 non-null int64 1 mo 66953 non-null object 2 dy 66953 non-null int64 3 date 66953 non-null object 4 st 66953 non-null object 5 mag 66953 non-null int64 6 inj 66953 non-null int64 7 fat 66953 non-null int64 8 slat 66953 non-null float64 9 slon 66953 non-null float64 10 elat 66953 non-null float64 11 elon 66953 non-null float64 12 len 66953 non-null float64 13 wid 66953 non-null int64 dtypes: float64(5), int64(6), object(3) memory usage: 7.7+ MB
tornadoMag.drop(tornadoMag[tornadoMag['mag'] < 3].index, inplace = True)
C:\Users\KPMal\AppData\Local\Temp\ipykernel_34368\2421647128.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
tornadoMag.head()
| yr | mo | dy | date | st | mag | inj | fat | slat | slon | elat | elon | len | wid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1950 | January | 3 | 1950-01-03 | IL | 3 | 3 | 0 | 39.10 | -89.30 | 39.12 | -89.23 | 3.6 | 130 |
| 1 | 1950 | January | 3 | 1950-01-03 | MO | 3 | 3 | 0 | 38.77 | -90.22 | 38.83 | -90.03 | 9.5 | 150 |
| 3 | 1950 | January | 13 | 1950-01-13 | AR | 3 | 1 | 1 | 34.40 | -94.37 | 0.00 | 0.00 | 0.6 | 17 |
| 10 | 1950 | February | 11 | 1950-02-11 | TX | 3 | 12 | 1 | 29.67 | -95.05 | 29.83 | -95.00 | 12.0 | 1000 |
| 15 | 1950 | February | 12 | 1950-02-12 | LA | 3 | 25 | 5 | 31.63 | -93.65 | 32.55 | -93.03 | 74.5 | 100 |
tornadoMag.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3176 entries, 0 to 67394 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 yr 3176 non-null int64 1 mo 3176 non-null object 2 dy 3176 non-null int64 3 date 3176 non-null object 4 st 3176 non-null object 5 mag 3176 non-null int64 6 inj 3176 non-null int64 7 fat 3176 non-null int64 8 slat 3176 non-null float64 9 slon 3176 non-null float64 10 elat 3176 non-null float64 11 elon 3176 non-null float64 12 len 3176 non-null float64 13 wid 3176 non-null int64 dtypes: float64(5), int64(6), object(3) memory usage: 372.2+ KB
fig = go.Figure(data=go.Scattergeo(
lon = tornadoMag['slon'],
lat = tornadoMag['slat'],
text = tornadoMag['yr'],
mode = 'markers',
marker_color = tornadoMag['mag'],
))
fig.update_layout(
title = 'F3 or greater Tornado touchdown points',
geo_scope='usa',
)
fig.show()
tornadoMag.drop(tornadoMag[tornadoMag['mag'] < 5].index, inplace = True)
C:\Users\KPMal\AppData\Local\Temp\ipykernel_34368\4435519.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
fig = go.Figure(data=go.Scattergeo(
lon = tornadoMag['slon'],
lat = tornadoMag['slat'],
text = tornadoMag['yr'],
mode = 'markers',
marker_color = tornadoMag['mag'],
))
fig.update_layout(
title = 'F5 Tornado touchdown points',
geo_scope='usa',
)
fig.show()